Public domain optical character recognition
نویسندگان
چکیده
A public domain document processing system has been developed by the National Institute of Standards and Technology (NIST). The system is a standard reference form-based handprint recognition system for evaluating optical character recognition (OCR), and it is intended to provide a baseline of performance on an open application. The system’s source code, training data, performance assessment tools, and type of forms processed are all publicly available. The system recognizes the handprint entered on Handwriting Sample Forms like the ones distributed with NIST Special Database 1. From these forms, the system reads handprinted numeric fields, upper and lowercase alphabetic fields, and unconstrained text paragraphs comprised of words from a limited-size dictionary. The modular design of the system makes it useful for component evaluation and comparison, training and testing set validation, and multiple system voting schemes. The system contains a number of significant contributions to OCR technology, including an optimized Probabilistic Neural Network (PNN) classifier that operates a factor of 20 times faster than traditional software implementations of the algorithm. The source code for the recognition system is written in C and is organized into 11 libraries. In all, there are approximately 19,000 lines of code supporting more than 550 subroutines. Source code is provided for form registration, form removal, field isolation, field segmentation, character normalization, feature extraction, character classification, and dictionary-based postprocessing. The recognition system has been successfully compiled and tested on a host of UNIX workstations including computers manufactured by Digital Equipment Corporation, Hewlett Packard, IBM, Silicon Graphics Incorporated, and Sun Microsystems.* This paper gives an overview of the recognition system’s software architecture, including descriptions of the various system components along with timing and accuracy statistics.
منابع مشابه
Cursive Character Challenge: a New Database for Machine Learning and Pattern Recognition
Cursive character recognition is a challenging task due to high variability and intrinsic ambiguity of cursive letters. This paper presents C-Cube (Cursive Character Challenge), a new public-domain cursive character database. C-Cube contains 57293 cursive characters manually extracted from cursive handwritten words, including both upper and lower case versions of each letter. The database can b...
متن کاملAn Effective and Interactive Training Data Collection Method for Early-Modern Japanese Printed Character Recognition
In this paper, we present a web application that supports to collect training data efficiently for early-modern Japanese printed character recognition. The national diet library in Japan provides a lot of early-modern (AD18681945) Japanese printed books to the public, but full-text search is essentially impossible. In order to perform advanced search in historical literatures, it is required ex...
متن کاملOff-line Handwriting Recognition from Forms
A public domain optical character recognition (OCR) system has been developed by the National Institute of Standards and Technology (NIST) to provide a baseline of performance on off-line handwriting recognition from forms. The system’s source code, training data, and performance assessment tools are all publicly available. The system recognizes the handprint written on Handwriting Sample Forms...
متن کاملAncient Sri Lankan Inscriptions and Optical Character Recognition
This chapter is written entirely base on the conducted literature survey. This includes the approaches that have been conducted by other researchers. The literature survey for this project was conducted through two directions. A thorough study was done to capture the domain knowledge about the ancient Sri Lankan inscriptions. And on the other hand the technologies that were used for optical cha...
متن کاملAn Efficient Scheme for Invariant Optical Character Recognition Using Triple Correlations
The implementation of an efficient scheme for translation, rotation and scale invariant optical character recognition is presented in this paper. An image representation is used, which is based on appropriate clustering and transformation of the image triple-correlation domain. This representation is one-to-one related to the class of all shifted-rotated-scaled versions of the original image, a...
متن کامل